1 


AD-A238  850 


ANALYSIS  OF  THE  ORGANIZATION  OF  LEXICAL  MEMORY 


George  A.  Miller 
Cognitive  Science  Laboratory 
Department  of  Psychology 
Princeton  University 


Final  Report 
30  June  1991 


i\'\ 


This  report  was  prepared  under  the  Navy  Manpower,  Personnel,  and  Training  R&D  Program 
of  the  Office  of  the  Chief  of  Naval  Research  under  Conuact  N00014-86-K-0492,  with  contri¬ 
butions  to  the  contract  from  the  Navy  Personnel  Research  and  Development  Center  and  from 
the  Office  of  Naval  Research  Cognitive  Science  Program.  The  research  was  also  supported  in 
part  by  a  contract  with  the  Army  Research  Institute  and  a  grant  from  the  James  S.  McDonnell 
Foundation.  Reproduction  in  whole  or  in  part  is  permitted  for  any  purpose  of  the  United  States 
Government  Approved  for  public  release;  distribution  unlimited. 


91-06498 


•>’ 


1 


REPORT  DOCUMENTATION  PAGE 


forrr  Approved 
OMt  f.'O  0704-0188 


la  REPOi^r  security  CLASSiF'CAT.ON 

Unclassified 


2a  security  CLASSiFiCAT.ON  AUTHORITY 


2t’  DECLASSIFICATION  -  DOWNGRADING  SCHEDULE 


4  PERFORMING  ORGANIZATION  REPORT  NUMBER(S) 


1b  RESTRICT. VE  MASk.NGS 


3  DiSTRiSU'-ON  .  AvAilAS  l.  TY  OF  SEPOR' 

Approved  for  puoiic  release; 
distribution  unlimited 


5  VtONiTORiNG  ORGAMZAT.ON  RERQR''  N^MS 


6b  office  SYMBOL 
(If  applicable) 


7a  NAME  OF  MON'TORiNG  ORGAN  ZAT  ON 

Cognitive  Science  Program 

Office  of  Naval  Research  (Code  11A2CS) 


7b  ADDRESS  iCify.  State,  and  ZIP  Code) 

800  North  Quincy  Street 
Arlington,  VA  22217-5000 


8b  OFFiCE  SYMBOL  j  9  PROCUREMENT  NS’RuMENT  iDE NT  iFiCA ' -QN 
(If  applicable)  | 

222  N00014-90-J-1692 


10  SOURCE  Of  -..ND  NG  NUV 


program 


6a  NAME  OF  PERFORMING  ORGANIZATION 
Princeton  University 


6c  ADDRESS  (C/ty,  State,  and  ZIP  Code) 

Princeton,  NJ  08544-1010 


8a  NAME  OF  FUNDING  -  SPONSOR. NG 
ORGANIZATION 


Sc,  ADDRESS  (C/ty,  Stare,  and  ZIP  Code) 


11  title  (Include  Security  C/ajj  fication) 

Analysis  of  the  Organization  of  Lexical  Memory  (Unclassified) 


12  personal  auTmoriS) 
George  A.  Miller 


13a  'YPE  OF  REPORT 

Final 


PRO, EC 

'AS»: 

NO 

NO 

RM33M20 

RR0420-0C 

14  DATc  Of  y'edi'  Vonrrj  Day)  'S  PAGE 

1991,  July  10  6 


16  SuRPlEMENTARY  notation 

Supported  by  the  Office  of  the  Chief  of  Naval  Research  Manpower,  Personnel,  and 
Training  R  &  D  Program  _ 


17  COSATi  CODES  I  18  SuBJECT  TERMS  (Cont/nue  on  reverse  -t  necessary  and  identify  by  block  numoerl 


GROUP 


SUB-GROUP 


Lexical  database,  natural  language  processing, 
lexicography 


■9  abstract  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

The  practical  outcome  of  the  project.  "Analysis  of  the  Organization  of  Lexical 

Memory,"  is  an  electronic  lexical  database  called  WordNet  that  can  be  incorporated  into 

computer  systems  for  processing  English  text.  WordNet  includes  approximately  45.000 

lexicalized  concepts,  providing  a  coverage  equivalent  to  a  handheld  dictionary.  The 

database  has  three  components,  one  each  for  nouns,  verbs,  and  adjectives.  The  semantic 

relations  that  organize  each  component  are  different,  but  in  general  a  lexicalized  concept  is 
represented  by  a  set  of  synonyms  that  can  be  used  to  express  the  concept,  the  familiar 

semantic  relations  are  represented  by  labeled  pointers  between  synonym  sets.  In  order  to 
create  the  database,  programs  were  written  to  write  and  edit  lexical  files,  to  convert  lexical 

files  into  a  database,  to  search  the  database,  to  strip  inflections  from  search  requests,  and  to 

display  retrieved  information  for  a  user. 


20  D'STRiBUTlON  ;  availability  OF  ABSTRACT 

E  JNCLASSi'T'EO-'UNLiMITED  □  SAME  AS  RPT  Q  pTiC  USERS 


22d  NA.ME  OF  RESPONSIBLE  INDIVIDUAL 

Dr.  Susan  Chinman  _ 


nn  ’  473_  JJJfJ  ereviuuseaitionsareobso.^.e 

S/.N  0i02-LF-ijl4-6603 


21  abstract  sECur.ty  Class. f. cat. on 
□  OTIC  USERS  Unclassified 


^  *  CwA^S*"  (A;  C'-.  ■ 


SECURITY  CLASSIFICATION  OF  ^HIS  PAGE 


19.  Abstract  continued 

Three  user  intc"faces  have  been  developed  for  WordNet.  (1)  The  simplest  is  a 
commandline  version  that  does  not  require  a  windowing  system  and  can  run  on  standard 
monitors.  (2)  A  browser  written  for  Sun  View  and  for  X-11  windows  is  intended  for  use 
with  an  on-line  dictionary;  by  using  WordNet.  the  dictionary  can  be  searched  conceptually 
as  well  as  alphabetically.  (d)  A  lexical  filter  written  for  X-11  windows  catches  unfamiliar 
words  in  a  text  file  and  suggest  alternative  expressions  that  an  author  may  wish  to  choose. 


DD  Form  1473,  JUN  86iReversei 


security  CLASSlFir.ATir 


ANALYSIS  OF  THE  ORGANIZATION  OF  LEXICAL  MEMORY 


Abstract 

The  practical  outcome  of  the  project,  “Analysis  of  tlie  Organization  of  Lexical  Memory,” 
is  an  electronic  lexical  database  called  WordNet  that  can  be  incorporated  into  computer  sys¬ 
tems  for  processing  Enghsh  text.  WordNet  includes  approximately  45,000  lexicalized  con¬ 
cepts,  providing  a  coverage  equivalent  to  a  handheld  dictionary.  The  database  has  three  com¬ 
ponents,  one  each  for  nouns,  verbs,  and  adjectives.  The  semantic  relations  that  organize  each 
component  are  different,  but  in  general  a  lexicalized  concept  is  represented  by  a  set  of 
synonyms  that  can  be  used  to  express  the  concept,  and  familiar  semantic  relations  are 
represented  by  labeled  pointers  between  synonym  sets.  In  order  to  create  the  database,  pro¬ 
grams  were  written  to  write  and  edit  lexical  files,  to  convert  lexical  files  into  a  database,  to 
search  the  database,  to  strip  inflections  from  search  requests,  and  to  display  retrieved  informa¬ 
tion  for  a  user. 

Three  user  interfaces  have  been  developed  for  WordNet.  (1)  The  simplest  is  a  command¬ 
line  version  that  does  not  require  a  windowing  system  and  can  run  on  standard  monitors.  (2) 
A  browser  written  for  SunView  and  for  X-1 1  windows  is  intended  for  use  with  an  on-line  dic¬ 
tionary;  by  using  WordNet,  the  dictio'iary  can  be  searched  conceptually  as  well  as  alphabeti¬ 
cally.  (3)  A  lexical  filter  written  for  X-11  windows  catches  unfamihar  words  in  a  text  file  and 
suggests  alternative  expressions  that  an  author  may  wish  to  choose. 


Background 

The  on-line  database  now  known  as  Word- 
Net  began  as  an  experiment  desiped  to  test 
whether  certain  psycholinguistic  claims — namely, 
that  the  organization  of  lexical  memory  can  be 
represented  as  a  network  of  labeled  nodes  (for 
lexicalized  concepts)  connected  by  labeled  arcs 
(for  semantic  relations  between  concepts) — could 
be  extended  to  cover  the  entire  lexical  core  of 
English.  These  claims,  which  can  be  referred  to 
generically  as  the  relational  hypothesis,  were 
stated  in  the  psychohnguistic  literature  in  very 
general  terms,  but  were  usually  illustrated  with 
only  a  handful  of  carefully  chosen  lexical  items. 
Moreover,  this  relational  hypothesis  contrasted 
with  other  psycholinguistic  claims,  which  can  be 
referred  to  generically  as  the  componential 
hypothesis,  to  the  effect  that  the  organization  of 
lexical  memory  is  best  represented  by  analysis 
into  semantic  components,  rather  than  into 
semantic  relations.  Fundamental  questions  about 
the  theory  of  lexical  knowledge — such  questions 
as  how  much  of  the  descriptive  load  can  be  car¬ 
ried  by  relations  and  how  much  by 
components — were  unanswered.  In  order  to  pur¬ 
sue  such  questions,  therefore,  it  was  decided  to 
push  the  relational  approach  as  far  as  it  would 

go - (0  3pply  it  cnKctonMvp 

lexicon  of  English — to  see  where  it  fails  and  to 
discover  what  kinds  of  lexical  knowledge  require 
more  sophisticated  analysis. 


The  experiment  can  be  counted  a  success, 
although  a  relational  characterization  of  lexical 
memory  for  all  of  English  could  not  be  impic 
mented  as  directly  as  had  been  anticipated  at  the 
beginning;  a  number  of  unexpected  problems  had 
to  be  resolved  in  order  to  carry  it  through.  An  ini¬ 
tial  decision  was  made  to  limit  the  experiment  to 
semantic  relations  between  open  class  words; 
closed  class  words  (prepositions,  pronouns,  con¬ 
junctions,  articles,  etc.)  are  better  characterized 
by  their  syntactic  properties  and  relations,  and  for 
practical  applications  in  natural  language  process¬ 
ing  the  closed  class  words  should  be  an  integral 
pan  of  the  parsing  program.  But  even  for  open 
class  words  there  are  differences  between  pans  of 
speech  that  a  relational  representation  must 
respect:  for  nouns,  the  relation  of  class  inclusion 
is  most  important;  for  verbs,  a  complex  set  of 
entailment  relations  is  required;  and  modifiers  are 
best  characterized  in  terms  of  oppositions.  Con¬ 
sequently,  discovenng  what  semantic  relations  to 
use  required  three  concurrent  and  related  investi¬ 
gations,  and  resulted  in  three  relatively  indepen¬ 
dent  networks:  one  each  for  nouns,  verbs,  and 
adjecuves. 

Semantic  Relations 

tci-^s  should  a  semanue  relation 
relate?  A  basic  assumption  here  is  that  a  distinc¬ 
tion  must  be  drawn  between  two  common  senses 
of  the  word  “word,”  between  words  as  concrete 
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forms  (strings  of  ASCII  characters  in  this 
instance)  and  words  as  abstract  concepts  that  the 
forms  can  be  used  to  express.  Since  computers 
see  character  strings  where  people  see  concepts, 
an  mportant  goal  of  this  work  was  to  give  corn- 
outers  something  that  could  be  processed  as  peo¬ 
ple  process  concepts.  The  initial  assumption, 
therefore,  was  that  semantic  relations  should  be 
relations  between  lexicalized  concepts. 

A  wide  variety  of  semantic  relations  has 
been  described  in  the  technical  literature,  but  few 
were  deemed  suitable  for  this  research.  The  cri¬ 
teria  for  adoption  are  simple:  (1)  Since  the  basic 
conception  is  that  of  a  network,  binary  (two-term) 
semantic  relations  were  presupposed.  (2)  Since 
broad  coverage  of  the  lexicon  is  a  prime  con¬ 
sideration,  semantic  relations  with  a  narrow  range 
of  application  arc  neglected  (the  relation  “ances¬ 
tor  of,”  for  example,  applies  only  between  kin 
terms).  (3)  Since  the  network  is  intended  for 
users  without  special  training  in  linguistics, 
semantic  relations  must  be  intuitively  obvious  to 
laypersons.  (4)  Since  workers  creating  the  data¬ 
base  are  necessarily  dependent  on  standard  lexi¬ 
cographic  references,  semantic  relations  that  are 
regularly  coded  in  dictionaries  and  thesauruses 
are  preferred.  (5)  Since  exploration  of  the  net¬ 
work  in  any  direction  is  desired,  only  semantic 
relations  that  have  an  obvious  reciprocal  relation 
are  adopted.  A  number  of  semantic  iclations  sui  - 
vived  these  criteria. 

The  attempt  to  limit  WordNet  to  semantic 
relations  between  lexicalized  concepts  failed;  in 
particular,  synonymy  and  antonymy,  two  basic 
semantic  relations,  hold  between  lexic'l  forms. 
The  other  semantic  relations,  however,  are  rela¬ 
tions  between  lexicalized  concepts. 

Synonymy:  Two  word  forms  are  synonyms  if 
there  are  linguistic  contexts  in  which  one  can  be 
substituted  for  the  other  without  altering  the 
meaning;  “snake”  and  “serpent.”  (N,  V,  Adj) 

Antonymy:  Tv  o  word  forms  are  direct  antonyms 
if  one  is  the  conventional  opposite  of  the  other, 
“clean”  and  “dirty.”  (N,  V,  Adj) 

Hyponymy/Hypernymy:  Forms  expressing  con¬ 
cept  A  are  hyponyms  (subordinates,  subsets)  of 
forms  expressing  concept  B  if  A  is  included  in  B. 
If  is  a  liyponym  of  F^.  then  Fg  is  a  hypemym 
(.superordmaie,  superset)  of  F^;  “A  house  is  a 
(kind  oQ  building.”  fN> 

2  Toponymy:  Forms  expressing  concept  A  are  tro- 
ponyms  of  forms  expressing  concept  B  if  A  is  a 
particular  manner  of  doing  B;  “To  march  is  to 


walk  in  a  particular  manner.”  The  reciprocal 
relation  is  also  coded  in  the  database,  but  is  called 
simply  “superordinate.”  (V) 

MeronymylHolonymy:  Forms  expressing  concept 
A  are  meronyms  of  forms  expressing  concept  B  if 
A  is  a  part  of  B.  If  F^  is  a  meronym  of  Fg,  then 
Fg  is  a  holonym  of  F^.  Three  types  of  part  rela¬ 
tions  are  coded:  (1)  member  (“The  navigaior  is 
pan  of  the  crew”);  (2)  material  (“The  paper  is 
pan  of  the  page”);  (3)  component  (“The  wing  is 
pan  of  the  plane”).  When  the  meronym  type  was 
uncertain  it  was  coded  as  a  component  pan.  (N) 

Entailment:  Forms  expressmg  concept  A  entail 
forms  expressing  concept  B  if  the  occurrence  of 
B  is  necessary  for  the  occunence  of  A,  and  F^ 
and  Fg  are  not  related  by  troponymy;  “To  fail 
entails  trying.”  (V) 

Cause:  A  special  case  of  entailment;  “To  kill  is 
to  cause  to  die.”  (V) 

AH  of  these  semantic  relations  hold 
between  words  or  concepts  in  the  same  syntactic 
category.  Two  additional  semantic  relations — “is 
an  attribute  of’  and  “is  a  function  of — have  not 
yet  been  coded.  Both  require  pointers  between 
syntactic  categories;  between  adjectives  and 
nouns  in  the  case  of  attributes;  between  verbs  and 
nouns  in  the  case  of  functions.  It  is  believed  that 
these  relations  can  be  added,  and  that  the  result 
will  be  a  better  simulation  of  lexical  memory  and 
a  more  useful  database  for  practical  applications. 

Although  the  relations  listed  above  suffice 
to  account  for  most  common  word  associations,  at 
least  one  important  feature  of  lexical  memory  is 
not  captured  by  a  purely  relational  approach, 
namely,  differences  in  the  familiarity  of  different 
words.  Although  frequency  of  occurrence  is  the 
preferred  measure  of  familiarity,  counts  broken 
down  by  part  of  speech  are  not  presently  avail¬ 
able  for  all  of  the  words  included  in  this  database. 
So  an  alternative  measure  was  adopted.  In  gen¬ 
eral,  the  more  familiar  a  word  is,  the  more  alter¬ 
native  senses  it  has,  so  a  sense  count  was  made 
for  an  on-line  dictionary;  the  results  are  included 
in  the  database  for  each  word  by  svntactic 
category. 

Finally,  since  selectional  restrictions — the 
restrictions  on  noun  phrases  that  can  serve  as 
cases  (or  arguments)  of  a  verb — are  so  important 
for  syntax,  the  database  includes  33  different  sen- 
t«,iicc  f'a.m"';  indiccti;.g  tne  adn'...i"  b!c  syntactic 
structures  for  each  sense  of  every  verb. 
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Implementation 

In  order  to  realize  a  computer  simulation  of 
this  lexical  system,  it  was  necessary  to  have  a 
computer  representation  for  lexicalized  concepts 
as  well  as  lexical  forms.  The  following  assump¬ 
tion,  therefore,  is  basic  to  the  implementation:  a 
lexicalized  concept  can  be  represented  by  a  set  of 
word  forms  that  can  express  that  concept  when 
used  in  appropriate  contexts.  For  example,  the 
set  (case,  lawsuit)  would  represent  a  different 
meaning  of  “case”  than  would  {case,  box,  car¬ 
ton)  or  (case,  patient).  Such  sets  of  words  are 
called  synonym  sets  or,  briefly,  synsets.  Of 
course,  a  computer  that  is  given  a  synset  does  not 
“understand”  anything,  but  a  human  who  knows 
the  language  will  recognize  the  intended  mean¬ 
ing.  But  the  computer  should  be  able  to  process  a 
synset  in  a  manner  analogous  to  the  way  people 
process  the  corresponding  concept 

As  work  progressed,  however,  it  was 
discovered  that  synonyms  are  not  always  avail¬ 
able  to  signal  conceptual  differences  between 
synsets.  Therefore,  the  standard  lexicographic 
method  of  adding  a  defining  gloss  was  adopted  to 
clarify  the  intended  distinctions.  Since  this  resort 
to  definitions  came  relatively  late,  they  are  avail¬ 
able  for  only  about  30%  of  the  synsets.  They  are 
coded  parenthetically  and  can  be  either  displayed 
or  suppressed  by  the  interface. 

Given  this  coding  for  synonymy,  other 
semantic  relations  can  be  coded  either  by  pointers 
between  word  forms  or  by  pointers  between  syn¬ 
sets.  For  example,  the  fact  that  “war”  is  an  anto¬ 
nym  of  “peace”  is  coded  [war  !-♦  peace),  and 
the  fact  that  tennis  is  a  kind  of  court  game  is 
coded  (tennis,  lawn_tennis)  <S)— »  {court_game). 
These  semantic  relations  are  entered  by  lexical 
coders;  the  reciprocal  relations  are  then  added 
automatically  by  a  program  known  as  the 
“grinder,”  which  converts  lexical  files  into  a  lex¬ 
ical  database. 

Software  developed  in  order  to  implement 
this  system  is  written  in  C  and  C-m-  and  includes 
the  following  components: 

Editor.  These  programs  support  the  work  of 
entering  information  into  the  lexical  files.  To 
supplement  the  editor,  there  are  programs  to 
search  and  display  the  contents  of  on-line  dic¬ 
tionaries,  to  verify  the  syntax  of  the  lexical  files, 
to  recast  a  noun  file  in  the  form  of  an  outline,  and 
to  provide  an  archive  to  keep  track  of  the  files  as 
they  are  edited  and  up-dated. 


Grinder:  This  large  program  turns  the  lexical 
files  into  a  database.  It  first  checks  for  coding 
errors  and  requests  corrections.  Then  it  inserts  all 
of  the  reciprocal  semantic  relations  that  coders 
omit,  and  outputs  the  result  as  a  coherent  database 
with  a  unique  identifier  for  every  synset.  Finally, 
it  constructs  an  index  of  the  letter  strings,  listing 
all  of  the  synsets  in  which  each  string  appears. 

Search  routines:  A  set  of  routines  accepts 
requests  as  input  and  returns  iii.urmation 
retrieved  from  the  database.  A  request  consists  of 
a  letter  string  and  an  identifier  for  the  kind  of 
semantic  relation  that  is  desired. 

Morphology:  The  WordNet  database  contains 
primarily  canonical  word  forms.  That  is  to  say,  it 
contains  information  about  the  singular  “tree” 
but  not  about  the  plural  “trees,”  about  present 
tense  “hurl”  but  not  past  tense  “hurled,”  etc. 
For  practical  applications,  therefore,  it  is  neces¬ 
sary  to  have  a  morphology  program  that  will 
transform  these  inflected  forms  into  the  canonical 
forms  contained  in  the  database.  This  program  is 
fairly  conventional.  It  contains  an  extensive  list 
of  exceptions — words  that  do  not  follow  the  rules 
of  English  morphology.  If  a  requested  character 
string  is  on  this  list,  its  canonical  form  will  be 
used  to  search  the  database.  If  a  character  string 
is  not  on  the  exception  list  and  is  not  in  the  data¬ 
base,  the  program  will  attempt  to  strip  inflections 
from  it  in  order  to  arrive  at  a  string  that  can  be 
found  in  the  database.  Only  if  these  attempts  fail 
will  the  program  report  that  the  string  is  not  in  the 
database. 

Combined  with  search  routines,  this  mor¬ 
phology  program  lakes  inflected  inputs  and 
returns  canonical  outputs,  e.g.,  a  request  for 
synonyms  of  “hurled”  will  elicit  “throw.”  A 
more  sophisticated  morphology  program  that  will 
return  inflected  outputs — one  that  will  give 
“threw”  or  “thrown”  as  synonyms  of 
“hurled” — is  under  development  as  part  of  the 
lexical  filter  application  described  below. 

Interface:  Several  interfaces  have  been  created  to 
display  information  that  is  retrieved  for  the  user. 
The  simplest  is  a  command-line  version  that  can 
be  used  on  any  monitor.  A  more  elaborate  inter¬ 
face,  using  SunView  (a  windowing  system  owned 
by  Sun  Microsystems,  Inc.)  was  used  for  systems 
development.  And  an  interface  usiiig  the  X-ll 
window  system  was  developed  for  general  distri¬ 
bution  with  the  database.  These  interfaces  are 
described  in  more  detail  in  the  section  on  Appli¬ 
cations,  below. 
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Man  pages:  For  Unix  systems,  a  set  of  man 
pages  is  available.  A  user  should  look  first  at 
wnintro(l),  which  gives  an  overview  of  the  man 
pages  in  chapter  1  of  the  manual.  They  include 
nverify(l)  to  describe  a  program  that  checks  the 
syntax  of  lexical  files,  grind(l)  to  describe  opera¬ 
tion  of  the  grinder,  wntool(l)  for  the  SunView 
interface,  xwn(l)  for  the  X-11  interface,  and 
wn(l)  for  the  command-line  interface.  There  is 
also  wnintro(5),  which  introduces  wninput(5)  for 
the  r'T’.ta.''.  of  the  lexical  input  files  and  wndb(5) 
for  the  syntax  of  the  database  itself. 

Coverage 

The  goal  for  WordNet  was  to  include 
approximately  the  same  vocabulary  that  one 
expects  to  find  in  a  collegiate  dictionary.  Because 
the  format  is  so  different  from  a  printed  diction¬ 
ary,  however,  numerical  comparisons  cannot  be 
made  directly.  Three  different  numbers  are 
needed  to  characterize  the  size  of  WordNet;  (1) 
the  number  of  character  strings  (ASCII  strings); 
(2)  the  number  of  synsets;  and  (3)  the  number  of 
unique  string-synset  combinations.  (If  the  same 
string  occurs  in  five  synsets,  it  counts  as  one 
string  but  five  unique  string-synset  combinations, 
i.e.,  each  distinct  sense  of  a  string  is  considered  to 
be  a  different  word.)  These  numbers,  broken 
down  by  syntactic  category,  are  given  in  the  fol¬ 
lowing  table,  where  the  unique  string-synset  com¬ 
binations  are  referred  to  simply  as  “Words.” 


Category 

Strings 

Synsets 

Words 

Nouns 

36,114 

28,276 

48,672 

Verbs 

9,699 

6,087 

15,824 

Adjectives 

12,283 

10,620 

23,912 

Total 

58,096 

44,983 

88.408 

Much  of  the  work  of  creating  WordNet, 
however,  consisted  of  inserting  pointers  between 
synsets  to  represent  semantic  relations  between 
concepts,  and  the  novelty  and  utility  of  the  system 
depends  on  these  relations.  The  total  numbers  of 
pointers  for  the  various  semantic  relations  coded 
in  the  database  are  shown  in  the  following  table. 

Category  Pointers  Definitions 
Nouns  40,087  7,164" 

Verbs  10,771  2,562 

Adjectives  13,854  3,962 

Total  64,712  13,688" 

This  table  also  gives  the  number  of  synsets  in 
each  syntactic  category  that  have  an  accompany¬ 


ing  parenthetical  defining  phrase. 

Applicatioms 

Although  initially  intended  as  an  experi¬ 
ment,  the  success  of  the  experiment  will  be  tested 
by  the  usefulness  of  the  resulting  database.  The 
WordNet  database  is  available  for  general  use  in 
namraJ  language  processing  and  is  expected  to 
enrich  the  content  of  a  variety  of  practical  appli¬ 
cations.  Three  examples  were  developed  under 
this  contract,  two  of  which  (a  command  line  inter¬ 
face  and  a  browser)  were  required  in  order  to 
develop  the  database,  and  one  (a  lexical  filter)  is 
intended  to  assist  writers. 

Command  line:  The  simplest  interface  requires  a 
user  to  tag  the  request  for  information  about  a 
word  with  an  indication  as  to  what  information  is 
requested.  This  interface  can  deal  with 
inflectional  morphology.  For  example,  the  com¬ 
mand  line: 

wn  went  -synsv 

returns  all  synsets  for  the  verb  “go.”  The  com¬ 
mand  with  three  tags: 

wn  fights  -synsn  -synsv  -synsa 
will  elicit  a  report  for  all  synsets  of  “fight”  (in 
this  case,  as  a  noun  and  verb,  but  not  as  an  adjec¬ 
tive).  The  wn  command  without  arguments  is  a 
request  for  help:  it  produces  a  list  of  all  the  avail¬ 
able  tags.  Definitional  glosses  will  not  be  shown 
unless  the  tag  -d  is  inserted  immediately  follow¬ 
ing  the  target  word. 

Although  the  command-line  interface  is 
simple,  some  of  the  commands  are  relatively 
complex.  For  example,  the  tag  -palln  will  not 
only  return  the  parts  that  are  directly  coded  as 
parts  of  the  searchword,  but  will  also  list  all  of  the 
parts  that  the  searchword  inherits  from  its  hyper- 
nyms. 

Browser:  The  interface  used  for  developing 
WordNet  was  called  “le^pert”  or  “browser.” 
Initially,  it  was  a  window  in  the  SunView  window 
system;  subsequently  it  was  rewriaen  as  an  X-11 
window.  A  target  word  can  be  typed  or  dragged 
to  the  input  slot  to  start  a  search.  If  the  word  is 
found  in  the  database,  buttons  appear  indicating 
that  WordNet  knows  about  the  word  as  a  noun,  or 
a  verb,  or  an  adjective,  or  some  combination. 
The  mouse  can  then  be  used  to  expose  a  menu 
that  lists  all  of  the  kinds  of  information  available 
about  that  word.  The  same  searches  are  available 
with  the  browser  that  are  available  with  the 
command-line  interface,  but  commands  that  will 
not  yield  information  are  “greyed  out”  on  the 
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menu.  By  selecting  from  the  menu,  a  user  can 
pursue  the  particular  semantic  lelaticn  cf  interest. 
For  nouns,  the  user  may  have  a  choice  among 
synonyms,  antonyms,  hyponyms,  hypemyms,  or 
meronyms,  or  may  ask  about  the  word’s  familiar¬ 
ity.  For  verbs,  the  user  may  select  from 
synonyms,  antonyms,  superordinates,  troponyms, 
entailments,  cause,  familiarity,  or  sentence 
frames.  For  adjectives,  the  user  may  select 
synonyms,  antonyms,  or  familiarity.  When  this 
interface  is  used  to  write  lexical  files,  it  is  used  in 
conjunction  with  on-line  dictionaries.  Thus  it 
becomes  possible  to  search  the  dictionary  concep¬ 
tually,  not  merely  alphabetically. 

Since  inflections  are  stripped  from  input 
requests,  the  browser  can  also  be  used  while  com¬ 
posing  a  text  file — words  in  the  text  can  be 
highlighted  with  the  cursor  and  dragged  to  Word- 
Net  The  third  interface  was  an  attempt  to  capi¬ 
talize  on  this  feature. 

Filter.  The  filter  program  is  an  attempt  to  use 
WordNet  as  pan  of  a  writer’s  assistant  It  is  not 
interactive.  It  takes  a  text  file  as  input  and  goes 
through  it  word  by  word.  If  a  word  in  the  text  is 
not  found  in  WordNet,  it  is  added  to  a  list  in  a  file 
of  “unknown  words.”  Experience  with  the  lexi¬ 
cal  filter  .has  shown  that  many  of  the  unknown 
words  are  propter  nouns,  some  are  typographical 
mistakes,  but  some  are  words  that  clearly  should 
be  added  to  the  WordNet  database.  If  a  word  in 
the  text  is  found  in  WordNet,  its  familiarity  is 
tested;  if  it  is  familiar,  the  filter  does  nothing,  but 
if  it  is  unfamiliar,  the  filter  prints  out  all  of  the 
synsets  in  which  the  word  occurs,  accompanying 
each  word  with  its  familiarity  value.  That  is  to 
say,  an  author  is  not  only  told  that  a  word  is 
unfamiliar;  an  attempt  is  made  to  suggest  more 
familiar  alternatives. 

In  its  present  form,  the  filter  frequently  sug¬ 
gests  alternatives  that  are  inappropriate.  For 
example,  they  may  be  for  the  vvTong  pan  of 
speech.  More  often,  even  when  they  are  in  the 
correct  syntactic  category,  they  include  other 
senses  of  the  word.  Since  the  filter  responds  to 
unfamiliar  words  and  unfamiliar  words  are  sel¬ 
dom  ambiguous,  th'*se  problems  are  not  severe. 
But  a  simple  parser  (or  “parts”  program)  that 
could  use  the  conicxt  in  order  to  discriminate 
among  nouns,  verbs,  and  adjectives  would  elim¬ 
inate  syntactic  confusions.  A  more  intelligent 
system  would  be  required  to  eliminate  semantic 
ambiguity.  For  example,  the  text-critiquing  pro¬ 
gram  being  developed  by  David  Kieras  at  the 
University  of  Michigan  is  one  such  intelligent 


system  for  assisting  writers;  Kieras  is  exploring 
the  use  of  the  semantic  information  in  WordNet 
to  enhance  the  capabilities  of  diat  system.  Other 
opportunities  to  evaluate  WordNet  in  a  testbed 
provided  by  a  language  understanding  system  are 
under  discussion. 

Preliminary  results  thus  confirm  the  com- 
monsense  conclusion  that  WordNet  is  best  used 
in  conjunction  with  other  components  as  one  part 
of  a  more  powerful  system  for  natural  langua<^e 
processing.  The  fact  that  such  marriages  are  pos¬ 
sible,  however,  indicates  that  WordNet  does  pro¬ 
vide  an  effective  combination  of  traditional  lexi¬ 
cographic  information  with  modem  computer 
technology. 

Availability 

Copyright  to  WordNet  is  held  by  Princeton 
University  m  order  to  protect  the  rights  of  the 
developers  to  use  their  own  work  and  make  it 
available  to  others,  and  an  application  is  being 
filed  to  protect  the  term  “WordNet”  However, 
an  early  version  has  been  running  on  computers 
at  NPRDC,  and  the  database,  search  code,  mor¬ 
phology  routines,  interface,  and  man  pages  (a  7- 
Mbyte  package,  WordNet  1.0)  are  available  for 
public  distribution.  Inquiries  addressed  to 
worrInet(2)princeton.edu  should  elicit  information 
about  how  to  obtain  these  materials  via  fq);  it  is 
hoped  that  the  Lexical  Consortium  at  New  Mex¬ 
ico  State  University  will  distribute  these  materi¬ 
als.  If  demand  justifies  it,  it  can  be  made  avail¬ 
able  on  a  cd-rom  disk. 

Contributors 

The  following  persons,  listed  in  alphabeti¬ 
cal  order,  worked  on  WordNet  prior  to  July  1991: 
Amalia  Bachman,  Richard  Beckwith,  Marie  Bien- 
kowski,  Patrick  Byme,  Roger  Chaffin,  George 
Collier,  Michael  Colon,  Melanie  Cook,  Chrisiianc 
Fellbaum,  Derek  Gross,  Brian  Gustafson,  Philip 
N.  Johnson-Laird,  Judith  Kegl,  Benjamin  O.  Mar¬ 
tin,  Elana  Messer,  George  A.  Miller,  Katherine  J. 
Miller,  Antonio  Romero,  Daniel  A.  Teibel,  Ran- 
dee  Tengi,  Anton  J.  Vishio,  Pamela  Wakefield. 
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